CASE: A Hierarchical Event Representation for the Analysis of Videos
نویسندگان
چکیده
A representational gap exists between low-level measurements (segmentation, object classification, tracking) and high-level understanding of video sequences. In this paper, we propose a novel representation of events in videos to bridge this gap, based on the CASE representation of natural languages. The proposed representation has three significant contributions over existing frameworks. First, we recognize the importance of causal and temporal relationships between sub-events and extend CASE to allow the representation of temporal structure and causality between sub-events. Second, in order to capture both multi-agent and multi-threaded events, we introduce a hierarchical CASE representation of events in terms of sub-events and case-lists. Last, for purposes of implementation we present the concept of a temporal event-tree, and pose the problem of event detection as subtree pattern matching. By extending CASE, a natural language representation, for the representation of events, the proposed work allows a plausible means of interface between users and the computer. We show two important applications of the proposed event representation for the automated annotation of standard meeting video sequences, and for event detection in extended videos of railroad crossings. Introduction Human community and society are built upon the ability to share experiences of events. Hence, in the enterprize of machine vision, the ability to represent and share observed events must be one of the ultimate, if most abstract, goals. With computer vision techniques maturing sufficiently to provide reliable low-level descriptions of scenes, the necessity of developing semantically meaningful descriptions of these low-level descriptors is becoming increasingly pressing. In this work, one primary objective is to present a coherent representation of events, as a means to encode the relationships between agents and objects participating in an event. We also emphasize, in particular, a representation that allows computers to share observations with other computers and also with humans, in terms of events. An event is defined as a collection of actions performed by one or more agents. Agents are animates that can perform actions independently or dependently (e.g. people or robots). The practical need for formal representation of events is best illustrated Copyright c © 2004, American Association for Artificial Intelligence (www.aaai.org). All rights reserved. through possible applications. These applications include: (1) Surveillance: By definition, surveillance applications require the detection of peculiar events. Event representations can be used for prior definition of what constitutes an interesting event in any given domain, allowing automation of area surveillance, (2) Video Indexing and Event Browsing: Given a query for a certain event (defined in terms of an event representation), similar instances can be retrieved from a database of annotated clips, (3) Annotation: In the spirit of MPEG-7, video sequences may be annotated autonomously based on their content, (4) Domain Understanding: It is noted that causality is an abstract that cannot be directly inferred from a single video sequence. Through the use of event representations, causality can be inferred between events in a single domain (e.g. surveillance of airports) across several extended video sequences for domain understanding. In literature, a variety of approaches have been proposed for the detection of events in video sequences. Most of these approaches can be arranged into two categories based on the semantic significance of their representations. This distinction is important, since it determines whether humans can exploit the representation for communication. Approaches where representations do not take on semantic meaning include Causal events (Brand 1997), Force dynamics (Siskind 2000), Stochastic Context Free Grammars (Bobick and Ivanov 1998), Spatio-temporal Derivatives (ZelnikManor and Irani 2001), and geometric properties and appearance (Malliot, Thonnat, and Boucher 2003). While they differ in approaches, the representations they employ do not lend themselves directly to interpretation or interface to humans. Learning methods such as Bayesian Networks and Hidden Markov Models (Ivanov and Bobick 2000) have been widely used in the area of activity recognition. A known drawback of learning methods is that they usually require large training sets of events and variation in data may require complete re-training. Similarly, there is no straightforward method of expanding the domain, once training has been completed. On the other hand, semantically significant approaches like the state machines (Koller, Heinze, and Nagel 1991), and PNF Networks (Pinhanez and Bobick 1998) provide varying degrees of representation to the actions and agents involved in the events. What is missing in these representations is coherence in describing low-level measurements as ‘events’. Can these representations be used to share knowledge between two KNOWLEDGE REPRESENTATION & REASONING 263 systems? Can events be compared on the basis of these representations? How are these representations related to human understanding of events? Can a human communicate his or her observation of an event to a computer or vice versa? By extending automatic generation of a natural language ontology to event representation of a video, a plausible interface between the human and the computer is facilitated. One such natural language representation called CASE was proposed by Fillmore (Fillmore 1968) for language understanding. The basic unit of this representation is a caseframe that has several elementary cases, such as an agentive, an instrumental, and a predicate. Using these caseframes Fillmore analyzed languages, treating all languages generically. However, CASE was primarily used for syntactic analysis of natural languages, and while it provides a promising foundation for event representation it has several limitations for that end. Firstly, since events are typically made up of a hierarchy of sub-events it is impossible to describe them as a succession of case-frames. Second, these sub-events often have temporal and causal relationships between them, and CASE provides no mechanisms to represent these relationships. Furthermore, there might be simultaneous dependent or independent sub-events with multiple agentives, and change of location and instrumentals during events. CASE was first investigated for event representation (Neumann 1989), but the author did not investigate the temporal structure of events as the author was not concerned with event detection. More recently (Kojima, Tamura, and Fukunaga 2001) addressed some shortcomings in CASE for single person event detection with, SO(source prefixed to case), GO(goal prefixed to case) and SUB (child frame describing a sub-event). SOand GOare prefixed to the LOC (locative) case mostly describing the source and destination locations of the agent in the event. A concept hierarchy of action rules (case-frames) was used to determine an action grammar (ontology) for the sequence of events. Also, using case-frames based on events, they reconstructed the event sequence in the form of sentences. Their method worked well for single person action analysis using the CASE representation. However, this work did not address important issues of temporal and causal relationships. Moreover, no mechanisms were proposed for multiple-agents or multithreaded events. We propose three critical extensions to CASE for the representation of events: (1) accommodating multiple agents and multi-threaded event, (2) supporting the inclusion of temporal information or temporal logic into the representation, and (3) supporting the inclusion of causal relationships between events as well. We also propose a novel event-tree representation, based on temporal relationships, for the detection of events in video sequences. Hence, unlike almost all previous work, we use both temporal structure and an environment descriptor simultaneously to represent an event. The Extended CASE framework: CASE In this section, the three extensions to the CASE framework are presented. Firstly, in order to capture both multi-agent and multi-thread events, we introduce a hierarchical CASE representation of events in terms of sub-events and caselists. Secondly, since the temporal structure of events is critical to understanding and hence representing events, we introduce temporal logic into the CASE representation based on the interval algebra in (Allen and Ferguson 1994). Lastly, we recognize the importance of causal relationships between sub events and extend CASE to allow the representation of such causality between sub-events. Multi-Agent, Multi-Thread Representation Except in constrained domains, events typically involve multiple agents engaged in several dependent or independent actions. Thus any representation of events must be able to capture the composite nature of real events. To represent multiple objects, we introduce the idea of case-lists of elements for a particular case. For example, if there are more than one agents involved in an event we add both in a caselist within AG, [ PRED: move, AG:{ person1, person2 }, ...] As in (Kojima, Tamura, and Fukunaga 2001), we use SUB to represent a sub-event that occurs during an event. However, this representation offers no means to represent several subevents or multiple threads. To represent multiple threads we add them to a list of sub-events in the SUB case. An example is shown below, “While Jack stole from the cashier, Bonnie robbed from the bank as Clyde was talking to the cashier” [ PRED: steal, AG: Jack, D: cashier, SUB:{ [ PRED: rob, AG: Bonnie, OBJ: bank ], [ PRED: talk, AG: { Clyde, cashier } ] }
منابع مشابه
CASEE: A Hierarchical Event Representation for the Analysis of Videos
A representational gap exists between low-level measurements (segmentation, object classification, tracking) and high-level understanding of video sequences. In this paper, we propose a novel representation of events in videos to bridge this gap, based on the CASE representation of natural languages. The proposed representation has three significant contributions over existing frameworks. First...
متن کاملTraffic Scene Analysis using Hierarchical Sparse Topical Coding
Analyzing motion patterns in traffic videos can be exploited directly to generate high-level descriptions of the video contents. Such descriptions may further be employed in different traffic applications such as traffic phase detection and abnormal event detection. One of the most recent and successful unsupervised methods for complex traffic scene analysis is based on topic models. In this pa...
متن کاملOrganization of Gatekeeping and Mental Framework in the System of Representation and Hierarchical Relational Structures of the Modern Society
Critical discourse analysis as a type of social practice reveals how linguistic choices enable speakers to manipulate the realizations of agency and power in the representation of action.The present study examines the relationship between language and ideology and explores how such a relationship is represented in the analysis of spoken text and to show how declarative knowledge, beliefs, attit...
متن کاملThematic analysis of the news of the 2020 Tokyo Olympics with emphasis on gender(case study: Shargh news paper)
abstract: The purpose of writing this article is to thematically analyze the news of the 2020 Tokyo Olympics by emphasizing gender and presenting an indigenous model of its related components using the theories of experts. The text of the Tokyo 2020 Olympic event is in Shargh 1400 newspaper (August 1 - August 17) which is a purposeful sampling, first based on commonalities, related them...
متن کاملAction Change Detection in Video Based on HOG
Background and Objectives: Action recognition, as the processes of labeling an unknown action of a query video, is a challenging problem, due to the event complexity, variations in imaging conditions, and intra- and inter-individual action-variability. A number of solutions proposed to solve action recognition problem. Many of these frameworks suppose that each video sequence includes only one ...
متن کاملVideos as Global Networks in the Practice of Migration (An Iranian Case Study)
Network society is an ever-changing robust system expanding new nods as long as they can communicate. Videos, as a source of information and communication, are one of the most strategic nods in this architecture. The present study is a scholarly attempt in investigating the effects of videos on facilitating the process of migration for the Iranian students. To this end, our case studies partici...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004